INTRODUCTION

In this article, the author analyzes the relationship between college majors and earnings potential, using data from the American Community Survey (ACS) and the U.S. Department of Education’s College Scorecard. The article emphasizes the importance of considering the cost of education in relation to potential earnings, as well as the long-term outlook for job growth in various fields. The article provides a link to the github repository for the data sources but does not have much code or any graphs (just some aggregation). I will be conducting some steps for data wrangling as well as producing some graphs that show helpful analysis about college majors 9 years back.

ANALYSIS

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0     ✔ purrr   1.0.1
## ✔ tibble  3.1.8     ✔ dplyr   1.1.0
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.3     ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(ggthemes)
library(scales)
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

Loading Data

recent_grads <- read.csv("recent-grads.csv")
head(recent_grads)
##   Rank Major_code                                     Major Total   Men Women
## 1    1       2419                     PETROLEUM ENGINEERING  2339  2057   282
## 2    2       2416            MINING AND MINERAL ENGINEERING   756   679    77
## 3    3       2415                 METALLURGICAL ENGINEERING   856   725   131
## 4    4       2417 NAVAL ARCHITECTURE AND MARINE ENGINEERING  1258  1123   135
## 5    5       2405                      CHEMICAL ENGINEERING 32260 21239 11021
## 6    6       2418                       NUCLEAR ENGINEERING  2573  2200   373
##   Major_category ShareWomen Sample_size Employed Full_time Part_time
## 1    Engineering  0.1205643          36     1976      1849       270
## 2    Engineering  0.1018519           7      640       556       170
## 3    Engineering  0.1530374           3      648       558       133
## 4    Engineering  0.1073132          16      758      1069       150
## 5    Engineering  0.3416305         289    25694     23170      5180
## 6    Engineering  0.1449670          17     1857      2038       264
##   Full_time_year_round Unemployed Unemployment_rate Median P25th  P75th
## 1                 1207         37        0.01838053 110000 95000 125000
## 2                  388         85        0.11724138  75000 55000  90000
## 3                  340         16        0.02409639  73000 50000 105000
## 4                  692         40        0.05012531  70000 43000  80000
## 5                16697       1672        0.06109771  65000 50000  75000
## 6                 1449        400        0.17722641  65000 50000 102000
##   College_jobs Non_college_jobs Low_wage_jobs
## 1         1534              364           193
## 2          350              257            50
## 3          456              176             0
## 4          529              102             0
## 5        18314             4440           972
## 6         1142              657           244
ggplot(recent_grads, aes(Median)) +
  geom_histogram(color ="black", fill = 'pink')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

> We can see that most of the majors make around 30k / year, I assume this data is 9 years ago, therefore the salary is a bit low compared to today. It can even be that this salary is right out of college too. There are also some majors’ salary are very high too which is close to 100k or more.

I’m making a new df to for capitalize the Major column and following Median column order.

majors_processed <- recent_grads %>%
  arrange(desc(Median)) %>%
  mutate(Major = str_to_title(Major),
         Major = fct_reorder(Major, Median))

head(majors_processed)
##   Rank Major_code                                     Major Total   Men Women
## 1    1       2419                     Petroleum Engineering  2339  2057   282
## 2    2       2416            Mining And Mineral Engineering   756   679    77
## 3    3       2415                 Metallurgical Engineering   856   725   131
## 4    4       2417 Naval Architecture And Marine Engineering  1258  1123   135
## 5    5       2405                      Chemical Engineering 32260 21239 11021
## 6    6       2418                       Nuclear Engineering  2573  2200   373
##   Major_category ShareWomen Sample_size Employed Full_time Part_time
## 1    Engineering  0.1205643          36     1976      1849       270
## 2    Engineering  0.1018519           7      640       556       170
## 3    Engineering  0.1530374           3      648       558       133
## 4    Engineering  0.1073132          16      758      1069       150
## 5    Engineering  0.3416305         289    25694     23170      5180
## 6    Engineering  0.1449670          17     1857      2038       264
##   Full_time_year_round Unemployed Unemployment_rate Median P25th  P75th
## 1                 1207         37        0.01838053 110000 95000 125000
## 2                  388         85        0.11724138  75000 55000  90000
## 3                  340         16        0.02409639  73000 50000 105000
## 4                  692         40        0.05012531  70000 43000  80000
## 5                16697       1672        0.06109771  65000 50000  75000
## 6                 1449        400        0.17722641  65000 50000 102000
##   College_jobs Non_college_jobs Low_wage_jobs
## 1         1534              364           193
## 2          350              257            50
## 3          456              176             0
## 4          529              102             0
## 5        18314             4440           972
## 6         1142              657           244

I want to create a new df to see the Total Observations of each Major_Category:

by_major_category <- majors_processed %>%
  filter(!is.na(Total)) %>%
  group_by(Major_category) %>%
  summarize(Men = sum(Men),
            Women = sum(Women),
            Total = sum(Total),
            College_jobs = sum(College_jobs),
            Non_college_jobs = sum(Non_college_jobs),
            MedianSalary = sum(Median * Sample_size) / sum(Sample_size)) %>%
  mutate(ShareWomen = Women / Total)

head(by_major_category)
## # A tibble: 6 × 8
##   Major_category               Men  Women  Total Colle…¹ Non_c…² Media…³ Share…⁴
##   <chr>                      <int>  <int>  <int>   <int>   <int>   <dbl>   <dbl>
## 1 Agriculture & Natural Re…  40357  35263 7.56e4   18677   33217  35586.   0.466
## 2 Arts                      134390 222740 3.57e5   94785  163720  32046.   0.624
## 3 Biology & Life Science    184919 268943 4.54e5  151233  127182  34379.   0.593
## 4 Business                  667852 634524 1.30e6  148538  496570  40890.   0.487
## 5 Communications & Journal… 131921 260680 3.93e5   86556  172992  34738.   0.664
## 6 Computers & Mathematics   208725  90283 2.99e5  137859   74463  46993.   0.302
## # … with abbreviated variable names ¹​College_jobs, ²​Non_college_jobs,
## #   ³​MedianSalary, ⁴​ShareWomen

Let’s see the correlation between variables within the dataframe we just created.

pairs(by_major_category[,2:8], lower.panel = NULL)

> Looks like there are no significant variables that are corelated with each other.

Let’s see which Major_category makes the most in terms of salary

majors_processed %>%
  group_by(Major_category) %>%
  summarize(Median = median(Median)) %>%
  mutate(Major_category = fct_reorder(Major_category, Median)) %>%
  
  ggplot(aes(Major_category, Median)) +
  geom_col(fill = "lightgreen") +
  scale_y_continuous(labels = dollar_format()) +
  coord_flip()

What are the most common majors? (Add a bit colors)

majors_processed %>%
  mutate(Major = fct_reorder(Major, Total)) %>%
  arrange(desc(Total)) %>%
  head(20) %>%
  
  ggplot(aes(Major, Total, fill = Major_category)) +
  geom_col() +
  coord_flip() +
  scale_y_continuous(labels = comma_format()) +
  labs(x = "", y = "Total number of graduates") +
  theme(legend.position = "none")
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Major = fct_reorder(Major, Total)`.
## Caused by warning:
## ! `fct_reorder()` removing 1 missing value.
## ℹ Use `.na_rm = TRUE` to silence this message.
## ℹ Use `.na_rm = FALSE` to preserve NAs.

Highest Earning Major (in detail)?

majors_processed %>%
  filter(Sample_size >= 100) %>%
  head(20) %>%
  ggplot(aes(Major, Median, color = Major_category)) +
  geom_point() +
  geom_errorbar(aes(ymin = P25th, ymax = P75th)) +
  expand_limits(y = 0) +
  scale_y_continuous(labels = dollar_format()) +
  coord_flip() +
  labs(title = "What are the highest-earning majors?",
       subtitle = "Top 20 majors with at least 100 graduates surveyed",
       x = "",
       y = "Median salary")

What are the lowest earning majors (in detail)?

majors_processed %>%
  filter(Sample_size >= 100) %>%
  tail(20) %>%
  ggplot(aes(Major, Median, color = Major_category)) +
  geom_point() +
  geom_errorbar(aes(ymin = P25th, ymax = P75th)) +
  expand_limits(y = 0) +
  coord_flip()

Follow-Up

I added some color and make the graph interactive in terms of Median Salary and also order it make it easier to look and detect outliers.

interactive_plot <- ggplot(majors_processed, aes(Major_category, Median, fill = Major_category)) +
    geom_boxplot() +
    expand_limits(y = 0) +
    coord_flip() +
    theme(legend.position = "none") 

plotly::ggplotly(interactive_plot) 

Another interesting interactive plot that I also want to examine the proportion of genders for each major to see if there is something exciting.

gen_inter <-  majors_processed %>%
    arrange(desc(Total)) %>%
    head(20) %>%
    mutate(Major = fct_reorder(Major, Total)) %>%
    gather(Gender, Number, Men, Women) %>%
    
    ggplot(aes(Major, Number, fill = Gender)) +
    geom_col() +
    coord_flip()

plotly::ggplotly(gen_inter) 
ggplot(data = by_major_category, aes(x = Major_category, y = College_jobs, fill = College_jobs)) +               geom_bar(stat = "identity") + 
  labs(x = "Major", y = "Amount of Jobs Need a Degree (2012)") + 
  scale_y_continuous(labels = comma_format()) +
  theme(legend.position = "none") +
  coord_flip()

> Based on the bar chart, we can clearly sees that Education, Engineering and Health industries requires a College Degree since it requires a lot of expertise. I think it does not change until today also.

Since I see there are some outliers, I want to see which major has high salary 9 years back.

head(recent_grads %>%
  arrange(desc(Median)) %>%
  select(Major, Major_category, Median))
##                                       Major Major_category Median
## 1                     PETROLEUM ENGINEERING    Engineering 110000
## 2            MINING AND MINERAL ENGINEERING    Engineering  75000
## 3                 METALLURGICAL ENGINEERING    Engineering  73000
## 4 NAVAL ARCHITECTURE AND MARINE ENGINEERING    Engineering  70000
## 5                      CHEMICAL ENGINEERING    Engineering  65000
## 6                       NUCLEAR ENGINEERING    Engineering  65000

Turns out it is Petroleum Engineering that has outliers. But there might be no surprise that the major has a very high salary and can be much higher nowadays.

ggplot(majors_processed, aes(Sample_size, Median)) +
  geom_point() +
  geom_text(aes(label = Major), check_overlap = TRUE, vjust = 1, hjust = 1) +
  scale_y_continuous()+
  scale_x_log10() 

Another aspect I want to explore is the Unemployment Rate for each Major Category. I think this is quite interesting to see 9 years ago which industry has a highest unemployment rate.

majors_processed %>%
  group_by(Major_category) %>%
  ggplot(aes(x = fct_reorder(Major_category, Unemployment_rate), y = Unemployment_rate)) +
  geom_bar(stat = "identity", fill = "pink") +
  coord_flip() +
  labs(title = "Unemployment Rate by Major Category", x = "Major Category", y = "Unemployment Rate")

I got to learned Linear Regression in one of my classes, I would like to apply the knowledge to the predictor which is: “ShareWomen” and the outcome as “Median”.

majors_processed %>%
  select(Major, Total, ShareWomen, Sample_size, Median) %>%
  lm(Median ~ ShareWomen, data = ., weights = Sample_size) %>%
  summary()
## 
## Call:
## lm(formula = Median ~ ShareWomen, data = ., weights = Sample_size)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -260500  -61042  -13899   33262  865081 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    52073       1436  36.255   <2e-16 ***
## ShareWomen    -23650       2403  -9.842   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 123000 on 170 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.363,  Adjusted R-squared:  0.3592 
## F-statistic: 96.87 on 1 and 170 DF,  p-value: < 2.2e-16

The analysis found that there is a statistically significant negative relationship between the proportion of female graduates and median earnings, meaning that as the proportion of female graduates in a major increases, the median earnings of graduates in that major decrease. Specifically, for each 1% increase in the proportion of female graduates, the median earnings decrease by an estimated $23,650, after adjusting for the sample size of each major. The adjusted R-squared value of 0.3592 indicates that about 36% of the variance in median earnings can be explained by the proportion of female graduates, after taking into account the sample size of each major. The residual standard error of $123,000 suggests that there is still a lot of unexplained variability in the data, meaning that other factors besides the proportion of female graduates are likely also influencing median earnings.

DISCUSSION

Overall, I think this dataset is very interesting to explore and wrangle since it is 9 years ago. There are a lot of interesting findings such as the unemployment rate 9 years ago which falls into Engineering industry, this is also the highest paid and most of them requires a College degree. I also applied Linear Regression model above to see the relationship between Median Salary with the amount of Female Graduates.

Reflection: I personally think this is an awesome activity to enhance and improve our R skills as well as data analysis skills. I also did learn new things throughout the course of exploring the dataset. What I would do different is that I would read and learn the dataset more coherently next time before dive into data wrangling and visualizing. Last but not least, I would choose appropriate color theme throughout the graphs in order to make the audience see comfortably.